Version: 7.8

Best Practices for Achieving Maximum Availability

Overview

Maximizing system and service availability requires a coordination of several different elements in an IT environment. High availability requires that the application or service is fault tolerant and follows good design principles for redundancy and failover. But this is only the start and requires a much more holistic approach. Largely, maximizing application and service availability requires that the underlying infrastructure including hardware and network is also designed to be fault tolerant and that both proper change management and recovery processes are in place to minimize downtime. Lastly, proper monitoring of applications and systems is important to be aware of key failure points and risk areas. And proper planning needs to be put in place to coordinate across multiple teams to ensure success.Resolve Actions Pro has been fundamentally designed to be an enterprise-class software solution to provide the highest levels of uptime and failover with capabilities such as server clustering, gateway failover, site-to-site failover, to name a few. This document helps to describe how Actions Pro solution, together, with proper architecture, IT infrastructure components and proper processes can all help to maximize availability.

Key Elements to Maximize Availability

Fault Tolerance

Software and hardware inevitably fail for a variety of reasons, including device failures, misconfigurations, natural disaster, software bugs, etc.. Redundancy and due diligence are the best tools to minimize these problems. The following is a list of possible causes of down time, and best practices to proactively address them.

Actions Pro Software Failure

Numerous factors could impact the stability of Actions Pro. To minimize down time caused by failures from Actions Pro instances:

Ensure that the Actions Pro server clustering capabilities are properly deployed and configured.
Utilize secondary/failover Gateway so that it can take over the load if the primary Gateway becomes unresponsive.
If users are not able to access the Actions Pro system, ensure HTTP load balancer is properly deployed and configured.
Continuously monitor Actions Pro logs or process to ensure no errors are occurring.
Continuously monitor application health, using Actions Pro reports or other monitoring tools, to ensure that key metrics such as JVM heap memory usage, system load, or thread counts, etc. stay within the normal range.
Test and validate the health of individual Actions Pro components such as Runbooks, Gateways, Notifications, etc.

Database Failure

Although the database is a separate application that is not specifically part of the Actions Pro software, it stores important Actions Pro data such as Wiki content and Automations. To minimize down time caused by database failure:

Ensure that the database service is set up to be highly available, e.g. Oracle RAC or other configurations that enable the backup database server to take over the load quickly when the primary database fails.
Ensure that the database is monitored for sign of stress or failure.
Ensure sufficient space is allocated to store data generated from Actions Pro. Monitor the disk space usage to ensure sufficient space remains and plan for additional space ahead of time.

Elasticsearch Failure

Actions Pro uses Elasticsearch as the internal database to store information such as Worksheet and reporting related data, in addition to other functions such as search and collaboration. Actions Pro releases that use Elasticsearch version 1.7 are prone to instability when the number of indices grows to more than 500. To minimize down time caused by Elasticsearch failure:

Set up archiving to periodically move data from Elasticsearch to SQL database, thus reduce the number of indices in Elasticsearch.
- The archive settings are set by the rscontrol.archive.* settings in the blueprint.properties.
Archiving should be scheduled to run during off-peak hours.
- The initial archive scheduled execution time is set by the rscontrol.archive.schedule setting in the blueprint. This will set up a Scheduled Job called WORKSHEET_ARCHIVE. Once that job is set up the time can be changed by editing that job from the Scheduled Job menu in the UI.
Archiving should be tuned to strike a balance between performance and load on the system.
- The key settings for performance tuning are rscontrol.archive.blocksize, rscontrol.archive.smalltableblocksize, rscontrol.archive.sleeptime, rscontrol.archive.blocksleeptime, and rscontrol.archive.smalltableblocksleeptime.
- The archive moves data from Elasticsearch to the database in blocks, the block size settings control how many records are moved each time. The smaller the number the less CPU/Memory/IO each block will take, but the longer the total archive time will be. The block sleep time settings are how long the system will wait between each block, and the sleeptime is how long it will wait after finishing archiving one set of daily indices (i.e. worksheet or process request). This is to allow any automations that were held up by the increased load to catch up.
- The default settings assume that the system will be under light load during the archive period. If the system is at a medium or heavy load (which can be determined from the Admin System reports) then the block size settings should be lowered by 1/2 to 1/4 depending on how heavy the average load is, and the sleep time increased.
- If archiving is being turned on for the first time after a long period of usage, expect the first archive to take up much more time and resources than subsequent archives, as the first will have a lot more data to move.
Monitor the hard disk to ensure that it has sufficient space.
- Elasticsearch will stop allocating new shards to a node when 85% of the disk space is used up, and will start actively deallocating shards at 90%. Either will cause execution problems for Actions Pro so alarms should be set up to notify system administrators to either allocate more space or clean up disk space before these points are reached.

Storage Failure

Actions Pro data storage, especially the Elasticsearch cluster, may require a Storage Area Network or NAS system. To minimize down time caused by storage failure:

Monitor the health of the storage system to ensure its availability.
Monitor the disk space usage to ensure sufficient space remains and plan for additional space ahead of time.

Hardware Failure

Hardware, such as the physical hosts that run Actions Pro, could suffer failures of its own including power supply failures, hard drive failures, motherboard failures, etc. To minimize down time caused by hardware failure:

Ensure that Actions Pro cluster configuration is deployed. Having additional nodes increases the resiliency and capacity. In the case of one node goes down due to hardware failure, other nodes can take over the load with minimum interruption to the users.
Continuously monitor the state of the server and respond as quickly as possible if any node fails.
Utilize hardware that is hardened and datacenter class with hot-swappable hardware components and redundancy built in.

Site Failure

Natural disaster or major incidents such as region-wide power outage can render an entire site in-operational. Actions Pro supports site-to-site failover configuration and can be set up to failover to a secondary site in the case of a disaster. To minimize down time caused by site failure:

Establish a secondary site in a different geography to act as a Disaster Recovery cluster. The Disaster Recovery cluster should have a same capacity as the primary cluster.
Configure the Disaster Recovery cluster with the full resources and dependencies needed by Actions Pro to ensure that this cluster can function independently.
Periodically practice switchover/failback exercise to ensure that Disaster Recovery site can take over the load as quickly as possible if necessary.

3rd Party Application Failure

Actions Pro depends on 3rd party applications for tasks such as receiving tickets. If the 3rd party application is down while Actions Pro itself is functional, Actions Pro will not be able to process the data from the 3rd party application. To minimize down time caused by failures from 3rd party applications:

Ensure that the 3rd party applications are set up to be highly available to increase resilience.
Ensure that the 3rd party applications are set up in the Disaster Recovery site as well.
Monitor the 3rd party applications and actively test for their availability and response time. Intervene as soon as possible when sign of stress are showing.

Network Failure

Network connectivity issues can prevent Actions Pro from functioning properly and could also lead to false diagnostics. To minimize down time caused by network failures:

Ensure that there are multiple network paths where possible.
Actively monitor network connectivity, load, and latency.

User Security

If user can't login due to Directory Service unavailable, Actions Pro becomes unavailable to the user. To minimize down time caused by User Security issues:

Ensure that Directory Service is highly available.
Regularly monitor and test the Directory Service for availability and response time.
Regularly test user authentication to ensure the service is functional.

Change Management

Changes to software versions and configuration of operating systems, network devices, security, firewall, etc. could cause problems to workflow and automations and could be a major factor that can impact availability. Having a good Change Management process is critical for reducing down time caused by operation or caused by human errors.

Actions Pro Configuration

To minimize down time caused by Actions Pro software configuration issues:

Ensure proper change process and controls are in place so that configuration changes are planned, tested in non-production environment, approved, coordinated, and validated.
Ensure configuration is replicated to the DR site.
Periodically review performance data such as the number of events per second and average 15 minutes load of various Actions Pro components to ensure that Actions Pro configuration is still adequate.

Server/JVM Configuration

Ensure proper change processes and controls are in place so that changes to Servers/JVM are planned, tested in non-production environment, approved, coordinated, and validated.
Periodically review configurations to ensure that they are still adequate for the current setup and load.
Periodically review server access control to ensure that appropriate team members have proper access to the servers for monitoring and administration purpose.

Network, Security and Firewall Configuration

Ensure proper change processes and controls are in place so that changes to network or firewall are planned, tested in non-production environment, approved, coordinated, and validated. To minimize the risk of change, make incremental changes and validate each change before proceeding to the next change.
Periodically review the Actions Pro port requirements to ensure ports needed by Actions Pro are opened and available.
Periodically review SSL certificates to ensure that they don't expire and lead to login problems.

Software Updates or Upgrades

Patch, Hotfix, Minor Update

A patch, hotfix, or minor update does not require Actions Pro to be completely shut down prior to maintenance. For maximum availability:

Ensure that the Actions Pro software is properly deployed and configured for clustering. Utilize rolling updates (if permitted) to incrementally update components from the cluster.
Ensure that the updated component behaves and functions properly before moving to the next target.

Major Upgrade

A major software upgrade requires that all Actions Pro core components (e.g. RSControl, RSView, RSSearch) are upgraded at the same time. The best way to achieve maximum availability is to have another Actions Pro cluster that can take over the load while the primary cluster is being upgraded. Preferably, a temporary / dual primary cluster setup offers the maximum availability. Alternatively, switching over to a Disaster Recovery (DR) site for major upgrades could largely achieve the same effect.
If a temporary / dual environment can be set up (see documentation for more details):

Clone the primary cluster to a temporary cluster.
Upgrade the temporary cluster and validate that it is fully operational.
Incrementally migrate the Gateway load over to the temporary cluster.
Migrate the users over to the temporary cluster.
With no load and user on the primary cluster, upgrade the primary cluster.
Once the primary cluster is fully functional, using switchover to migrate the load from the temporary cluster back to the primary cluster.
After validating the primary cluster is fully operational, the temporary cluster can be decommissioned.

If a disaster recovery site is available:

Switch over from the primary cluster to the Disaster Recovery cluster.
After eliminating the need of rolling back, upgrade the primary cluster.
Switch back to the primary cluster.
Upgrade the Disaster Recovery cluster.
Test failover/failback procedures to ensure that both sites are fully functional and Disaster Recovery capability is not compromised because of the upgrade.

Content Development and Deployment Process

Bugs or inefficiency in new content could also affect the availability of Actions Pro. If a required artifacts is missing from deployment process, this would also impact availability. To minimize the down time from deployment of new content:

Ensure development, QA, and production clusters are properly deployed and have the same version.
Ensure that content is developed and fully tested on the development and QA cluster before it is deployed in a production cluster.
Exercise the deployment process on QA cluster to iron out any potential issues.
Ensure that new content is load and performance-tested if heavy load is expected.

Monitoring and Planning

Operating Systems Failure

To minimize down time caused by issues related to the underlying Operation System:

Ensure that the Operation System has been properly configured, e.g. sufficient process limit, so that Actions Pro can acquire necessary resource.
If Virtual Machines are used, ensure that Actions Pro runs on dedicated, not shared, resources.
If there are other applications running on the same host, ensure they do not negatively impact Actions Pro performance.
Continuously monitor the hosts for signs of stress, e.g. load, memory utilization, swap memory, process count, etc.

Security and Intrusion Detection

Ensuring that only appropriate and authorized personnel have access to Actions Pro is also an important factor that affects availability. Any changes made to Actions Pro should be coordinated and approved. To mitigate the risk of security threats:

Ensure that proper intrusion detection and protection mechanism are in place to detect and deter any unauthorized activities.
Periodically review audit log to check for any suspicious activities.
Periodically change the Actions Pro administrator or maintenance account credentials.
Periodically review the privileges associated with different groups or roles.

Capacity Planning

Capacity planning is critical to ensure that Actions Pro has what it needs to do the job, and to ensure that sufficient resource would be allocated to meet future growth. To minimize the down time caused by insufficient resources:

Continuously monitor key metrics such as server load, transaction count and latency, page views and response time, database and Elasticsearch usage, etc.
Compare the performance and load data with the historical data to detect any significant shift in performance and to locate opportunity for improvement.
Anticipate future growth and provision additional resource well ahead of time to ensure that Actions Pro has sufficient resource to meet the growing demand.

Conclusion

In summary, maximizing availability requires a holistic approach. The Actions Pro software system has built-in capabilities for fault tolerance, redundancy and site-to-site failover which can be utilized as a starting point. However, high availability requires that the underlying infrastructure including hardware and network is also designed to be fault tolerant and that both proper change management and recovery processes are in place to minimize human error and downtime. Proper monitoring and adequate planning also ensures that any risk areas are attended to as quickly as possible and sufficient resources are allocated to handle present and future needs. It is an ongoing process. Actions Pro capabilities enable the process by making the process more efficient and less error-prone and following the several of the best practices in this document will enable organizations to maximize the uptime and availability of the Actions Pro system.

Overview​

Key Elements to Maximize Availability​

Fault Tolerance​

Actions Pro Software Failure​

Database Failure​

Elasticsearch Failure​

Storage Failure​

Hardware Failure​

Site Failure​

3rd Party Application Failure​

Network Failure​

User Security​

Change Management​

Actions Pro Configuration​

Server/JVM Configuration​

Network, Security and Firewall Configuration​

Software Updates or Upgrades​

Patch, Hotfix, Minor Update​

Major Upgrade​

Content Development and Deployment Process​

Monitoring and Planning​

Operating Systems Failure​

Security and Intrusion Detection​

Capacity Planning​

Conclusion​